Decoding strategies for syntax-based statistical machine translation
نویسنده
چکیده
Translation is the task of transforming text from a given language into another. Provided with a sentence in an input language, a human translator produces a sentence in the desired target language. The advances in artificial intelligence in the 1950s led to the idea of using machines instead of humans to generate translations. Based on this idea, the field of Machine Translation (MT) was created. The first MT systems aimed to map input text into the target translation through the application of hand-crafted rules. While this approach worked well for specific language-pairs on restricted fields, it was hardly extendable to new languages and domains because of the huge amount of human effort necessary to create new translation rules. The increase of computational power enabled Statistical Machine Translation (SMT) in the late 1980s, which addressed this problem by learning translation units automatically from large text collections. Statistical machine translation systems can be divided into several paradigms depending on the form of the (automatically learned) units used during translation. Early systems modeled translation between words. Later work extended these units from single words to sequences of words called phrases. A common point between word and phrase-based SMT is that the translation process takes place sequentially. This left-to-right process is not well suited to translate between languages where several words need to be reordered over (potentially) long distance. Such reorderings, which take place between many language pairs (e.g. English-German, English-Chinese or English-Arabic), led to the implementation of SMT systems based on formalisms that allow to translate recursively instead of sequentially. In these systems, called syntax-based systems, the (automatically learned) translation units are modeled with formal grammar productions and translation is performed by assembling the productions of these grammars. Many different grammar formalisms have been developed to model translation. One of the first is the Synchronous Context-Free Grammar (SCFG) which is an extension of the well-known Context-Free Grammar (CFG). To overcome several drawbacks of SCFG, more powerful formalisms have been explored such as the
منابع مشابه
NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation
We present a new open source toolkit for phrase-based and syntax-based machine translation. The toolkit supports several state-of-the-art models developed in statistical machine translation, including the phrase-based model, the hierachical phrase-based model, and various syntaxbased models. The key innovation provided by the toolkit is that the decoder can work with various grammars and offers...
متن کاملImproving Statistical Machine Translation using Lexicalized Rule Selection
This paper proposes a novel lexicalized approach for rule selection for syntax-based statistical machine translation (SMT). We build maximum entropy (MaxEnt) models which combine rich context information for selecting translation rules during decoding. We successfully integrate the MaxEnt-based rule selection models into the state-of-the-art syntax-based SMT model. Experiments show that our lex...
متن کاملA new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملMaximum Entropy based Rule Selection Model for Syntax-based Statistical Machine Translation
This paper proposes a novel maximum entropy based rule selection (MERS) model for syntax-based statistical machine translation (SMT). The MERS model combines local contextual information around rules and information of sub-trees covered by variables in rules. Therefore, our model allows the decoder to perform context-dependent rule selection during decoding. We incorporate the MERS model into a...
متن کاملCohesive Phrase-Based Decoding for Statistical Machine Translation
Phrase-based decoding produces state-of-theart translations with no regard for syntax. We add syntax to this process with a cohesion constraint based on a dependency tree for the source sentence. The constraint allows the decoder to employ arbitrary, non-syntactic phrases, but ensures that those phrases are translated in an order that respects the source tree’s structure. In this way, we target...
متن کاملHierarchical phrase-based translation with weighted finite state transducers
This dissertation is focused in the Statistical Machine Translation field (SMT), particularly in hierarchical phrase-based translation frameworks. We first study and redesign hierarchical models using several filtering techniques. Hierarchical search spaces are based on automatically extracted translation rules. As originally defined they are too big to handle directly without filtering. In thi...
متن کامل